Coding Best Practice

What we will cover in this session


  • Why this is important/useful
  • Readable code
  • Naming things
  • Organising your projects
  • Version control


Disclaimer - some of it is quite specific to coding in R

Why this is important


It is important that your code is:

  • readable

  • understandable

  • reproducible

Particularly if you are collaborating with others but even if you are working on a project by yourself

Resources


Data Camp tutorial


R4DS Sections 2, 4 and 6


TidyVerse style guide


Google Style Guides - including R and Python

Readable code


For naming variables and functions there are two common conventions you can use:

  • theFirstOneIsCalledCamelCase

  • the_second_is_called_snake_case


Personal preference but be consistent
Do not use spaces in names

Readable Code


The janitor package in R has useful functions for cleaning variable names

# A tibble: 6 × 5
  `row ID` `Organisation Name` `Patient Age` `LENGTH OF STAY` Death.flag
     <dbl> <chr>                       <dbl>            <dbl>      <dbl>
1        1 Trust1                         55                2          0
2        2 Trust2                         27                1          0
3        3 Trust3                         93               12          0
4        4 Trust4                         45                3          1
5        5 Trust5                         70               11          0
6        6 Trust6                         60                7          0

Readable Code


The janitor package in R has useful functions for cleaning variable names

data <- clean_names(data)

head(data, n = 4)
# A tibble: 4 × 5
  row_id organisation_name patient_age length_of_stay death_flag
   <dbl> <chr>                   <dbl>          <dbl>      <dbl>
1      1 Trust1                     55              2          0
2      2 Trust2                     27              1          0
3      3 Trust3                     93             12          0
4      4 Trust4                     45              3          1

Readable Code


The janitor package in R has useful functions for cleaning variable names

data <- clean_names(data, case = "upper_camel")

head(data, n = 4)
# A tibble: 4 × 5
  RowId OrganisationName PatientAge LengthOfStay DeathFlag
  <dbl> <chr>                 <dbl>        <dbl>     <dbl>
1     1 Trust1                   55            2         0
2     2 Trust2                   27            1         0
3     3 Trust3                   93           12         0
4     4 Trust4                   45            3         1

Readable Code


Use spaces in lines of code

# bad

data$patient_age_grp<-if_else(data$patient_age<55,0,1)

trust1<-data[data$organisation_name=="Trust1",]

# better

data$patient_age_grp <- if_else(data$patient_age <= 55, 0, 1)

trust1 <- data[data$organisation_name == "Trust1", ]

Readable Code


Avoid lines that are too long

# bad

ggplot(data) +
  geom_point(aes(x = patient_age, y = length_of_stay, colour = as.factor(death_flag))) +
  theme_minimal() +
  labs(title = "Age and length of stay of patients at 10 hospital trusts", x = "Patient Age (years)", y = "Patient Length of Stay (Days)")
# better

ggplot(data) +
  geom_point(aes(x = patient_age, 
                 y = length_of_stay, 
                 colour = as.factor(death_flag))) +
  theme_minimal() +
  labs(title = "Age and length of stay of patients at 10 hospital trusts", 
       x = "Patient Age (years)", 
       y = "Patient Length of Stay (Days)")

Readable Code


If using the tidyverse or ggplot2 then start a new line after each %>% or +

data %>%
  filter(organisation_name == "Trust1") %>%
  ggplot(aes(x = patient_age, 
             y = length_of_stay, 
             colour = as.factor(death_flag))) +
  geom_point() +
  theme_minimal()

Readable Code


Use functions to avoid repeating lines of code

plot_function <- function(org_name) {
  
  age_los_plot <- data %>%
  filter(organisation_name == org_name) %>%
  ggplot(aes(x = patient_age, 
             y = length_of_stay, 
             colour = as.factor(death_flag))) +
  geom_point() +
  theme_minimal() +
  labs(x = "Patient Age (Years)",
       y = "Length of Stay (Days)")
  
  age_los_plot

}

orgs_list <- list("Trust1", "Trust2", "Trust3")

purrr::map(orgs_list, plot_function)

Readable Code

Readable Code

Use comments to annotate your code so it is easier to follow.
Particularly for documenting WHY you have done something

# Anything preceded by a # will not be executed by R

# 10*15

10*20
[1] 200

Naming Things


When you are naming new variables choose names that are descriptive.
Do not duplicate names

# bad

data %>%
  group_by(organisation_name) %>%
  summarise(n = sum(death_flag),
            mean_1 = mean(patient_age),
            mean_2 = mean(length_of_stay)) %>%
  head(n = 4)
# A tibble: 4 × 4
  organisation_name     n mean_1 mean_2
  <chr>             <dbl>  <dbl>  <dbl>
1 Trust1                7   55.4   5.07
2 Trust10               4   51.0   4.3 
3 Trust2                5   51.2   4.23
4 Trust3                6   47.9   5.07

Naming Things


When you are naming new variables or functions choose names that are descriptive.
Do not duplicate names

# better

data %>%
  group_by(organisation_name) %>%
  summarise(n_deaths = sum(death_flag),
            mean_patient_age = mean(patient_age),
            mean_length_of_stay = mean(length_of_stay)) %>%
  head(n = 4)
# A tibble: 4 × 4
  organisation_name n_deaths mean_patient_age mean_length_of_stay
  <chr>                <dbl>            <dbl>               <dbl>
1 Trust1                   7             55.4                5.07
2 Trust10                  4             51.0                4.3 
3 Trust2                   5             51.2                4.23
4 Trust3                   6             47.9                5.07

Naming Things


When you are naming new variables or functions choose names that are descriptive.
Do not duplicate names

# bad

model_a <- glm(data$patient_age ~ data$length_of_stay, family = gaussian())

model_b <- glm(as.factor(data$death_flag) ~ data$patient_age, family = binomial())

# better

model_los_age <- glm(data$length_of_stay ~ data$patient_age, 
                     family = gaussian())

model_death_age <- glm(as.factor(data$death_flag) ~ data$patient_age, 
                       family = binomial())

Naming Things


For naming files again use descriptive names. If working on a larger project then consider having a separate file for each stage of the project, and make it clear what order the analysis has been done in.

For example:
01_data_cleaning.R
02_baseline_characteristics.R
03_descriptive_stats.R
04_models.R
05_figures.R

Organising Your Work


Within an R script you can use sections to organise your scripts.

Insert a new section using ctrl + shift + R and navigate using the document outline on the right of the script

Organising Your Work

Organising Your Work


Working within an R Project is a good way to organise not only your R scripts but keeps all the data and outputs from your work in the same place.

Avoids the need to use set_wd() at the start of your scripts, which is not best practice, particularly when collaborating with others.

Organising Your Work


set_wd() uses absolute file paths, e.g.

setwd("C:/Users/mfbx9sbk/OneDrive - The University of Manchester/MSc Teaching/coding_best_practice_2")


This can cause problems when you are collaborating with others, as not everyone will have their files organised in the same way.

Organising Your Work


R Projects use relative file paths, which are relative to the working directory of the project.

For example, you want to save a cleaned version of your data, or a plot you have generated.

Organising Your Work


Here the file paths are relative to the Project directory

ggplot(data) +
  geom_point(aes(x = patient_age, y = length_of_stay))
ggsave("figs/age_los_scatter.png")


write_csv(data, "data/trust_los_clean.csv")


So if you shared the project with another person then it would not matter where they saved the project, all the file paths would work.

Organising Your Work

Organising Your Work

Organising Your Work

Organising Your Work


To set up an R Project go to File -> New Project

Organising Your Work

Organising Your Work

Organising Your Work

Organising Your Work


Use the README.MD document to briefly describe your project, including what you have done and what the output is.


Version Control


If you have ever had a bunch of files that look something like this then you may want to consider using a version control system to manage your projects

Version Control


Using a version control system can:

  • help organise your work and keep track of updates and changes
  • make it easier to collaborate with others
  • create a repository that can be shared more widely when a project is complete
  • be difficult to navigate at first but quickly become integrated into your regular workflow

Version Control


The most widely used (in the data science community) software for version control is Git.
Git takes snapshots of all files in a project at a specific time - referred to as a “commit”.
It stores the initial version and any subsequent updated versions that are committed.
It tracks any changes you have made at each commit, which can be identified using the “diff” command

Version Control


GitHub is a complementary hosting platform for your repositories (others are available).
Once updates have been committed to Git they can be “pushed” to GitHub.
Collaborators can then “fork” a copy of the repository and work on it locally whilst you are also still working on it, by pushing and pulling commits to GitHub.

Version Control


What a repository looks like on GitHub

Version Control


I would recommend reading this article which explains in more detail about how to use Git and GitHub.

Version Control


Git can be integrated into RStudio and therefore more easily be incorporated into your workflow.
Once installed an additional tab in the environment pane will appear, where you can commit and push files.

Version Control


Or you can go into the RStudio terminal tab and type Git commands from there

Version Control


To install Git and connect it to your GitHub and RStudio then follow this tutorial by Jenny Bryan. It talks through each setup step and how to do basic Git commands.